1 research outputs found

    Towards collaborative dialogue in Minecraft

    Get PDF
    This dissertation describes our work in building interactive agents that can communicate with humans to collaboratively solve tasks in grounded scenarios. To investigate the challenges of building such agents, we define a novel instantiation of a situated, Minecraft-based, Collaborative Building Task in which one player (A, the Architect) is shown a target structure, denoted Target, and needs to instruct the other player (B, the Builder) to build a copy of this structure, denoted Built, in a predefined build region. While both players can interact asynchronously via a chat interface, we define the roles to be asymmetric: A can observe B and Target, but is invisible and cannot place blocks; meanwhile, B can freely place and remove blocks, but has no explicit knowledge of the target structure. Each agent requires a different set of abilities in order to be successful at this task: specifically, A's main challenges arise in the task of generating situated instructions by comparing Built and Target, while B's responsibilities focus mainly on comprehending A's situated instructions using both dialogue and world context. Both agents must be able to interact asynchronously in an evolving dialogue context and a dynamic world state within which they are embodied. In this work, we specifically examine how well end-to-end neural models can learn to be instruction givers (i.e., Architects) from a limited amount of real human-human data. In order to examine how humans complete the Collaborative Building Task, as well as use human-human data as a gold standard for training and evaluating models, we present the Minecraft Dialogue Corpus, a collection of 509 conversations and game logs. We then introduce baseline models for the challenging subtask of Architect utterance generation, and evaluate them offline, using both automated metrics and human evaluation. We show that while conditioning our model on a simple representation of the world gives our model improved ability to generate correct instructions, there are still many obvious shortcomings, and it is difficult for these models to learn the large variety of abilities needed to be successful Architects in an entirely end-to-end manner. To combat this, we show that including meaningful, structured inputs about the world and discourse state as additional inputs -- specifically, by adding oracle information about the Builder's next actions, as well as enriching our linguistic representation with Architect dialogue acts -- improves the performance of our utterance generation models. We also augment the data with shape information by pretraining 3D shape localization models on synthetically generated block configurations. Finally, we integrate Architect utterance generation models into actual Minecraft agents and evaluate them in a fully interactive setting
    corecore